Deadline: Mar 13th, 23:00
Academic Integrity
This project is individual - it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!).
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.
In this project we work on a COVID-19 vaccination dataset for different countries at the end of each day.
Please fill out the following:
Download your notebook: File -> Download .ipynb
Click on the Files icon on the far left menu of Colab
Select & upload your .ipynb file you just downloaded, and then obtain its path (right click) (you might need to hit the Refresh button before your file shows up)
execute the following in a Colab cell:
%%shell
jupyter nbconvert --to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb
An HTML version of your notebook will appear in the files, so you can download it.
Submit both HTML and IPYNB files on Quercus for grading.
import pandas as pd
cases_raw = pd.read_csv(
filepath_or_buffer='https://raw.githubusercontent.com/APS1070-W23/datasets/main/country_vaccinations_wide.csv',
index_col=0
)
Write a function to do the following: [0.25]
Apply StandardScalar to the data. Each day should have a mean of zero and a StD of 1. [0.25]
Run the function in step 1 on the standardized dataset for the United States, China, and Canada. [0.25]
Discuss the trends in the standardized time-series for the United States, Canada, and China. What does it mean if the curve goes up or down (are the number of vaccinations negative?) What does the sign of values indicate? [0.25]
cases_raw.columns = pd.to_datetime(cases_raw.columns)
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def explore_dataset(dataset: pd.DataFrame) -> None:
'''
Plots the dataframe into subplots.
'''
len_dataset = len(dataset)
num_tables = len_dataset // 20
if num_tables * 20 < len_dataset:
num_tables += 1
for table_num in range(num_tables):
fig, axes = plt.subplots(4, 5, figsize=(18, 10))
axes = axes.flatten()
curr_table = dataset[table_num * 20: min((table_num + 1) * 20, len_dataset)]
for i, country in enumerate(curr_table.index):
ax = axes[i]
ax.plot(dataset.columns, dataset.loc[country], 'b-')
ax.title.set_text(country)
xmin, xmax = ax.get_xlim()
ax.set_xticks(np.round(np.linspace(xmin, xmax, 3), 2))
if table_num == num_tables - 1:
for j in range(i + 1, 20):
fig.delaxes(axes[j])
fig.tight_layout()
plt.show()
explore_dataset(cases_raw)
from sklearn.preprocessing import StandardScaler
from typing import Tuple
def standardize(dataframe: pd.DataFrame) -> Tuple[pd.DataFrame, StandardScaler]:
scaler = StandardScaler()
cases_standardized = pd.DataFrame(index=dataframe.index, columns=dataframe.columns, data=None)
cases_standardized[cases_standardized.columns] = scaler.fit_transform(dataframe[dataframe.columns])
return cases_standardized, scaler
cases_std, scaler = standardize(cases_raw)
US_CAN_CHINA = cases_std.loc[['United States', 'Canada', 'China']]
explore_dataset(US_CAN_CHINA)
Since the features (dates) are standardized with zero mean and unit variance, the number of cases is relative to the mean for that day globally, meaning a negative score indicates that country had a lower case count than the global aberage. A positive case-score means the country had more cases that day than the global average. The US plateaus to around the global mean daily case count after 2021-07-31 whereas Canada and China remain almost consistently above average.
get_sorted_eigen(df_cov) that gets the covariance matrix of dataframe df (from step 1), and returns sorted eigenvalues and eigenvectors using np.linalg.eigh. [0.25]def get_cov_mat(data: np.ndarray) -> np.ndarray:
'''
Gets covariance matrix of an array of standardized data.
'''
n, _ = data.shape
return np.dot(data.T, data) / (n-1)
from typing import List
def get_sorted_eigen(covariance_mat: np.ndarray) -> List[float]:
eigenValues, eigenVectors = np.linalg.eigh(covariance_mat)
args = (-eigenValues).argsort()
eigenValues = eigenValues[args]
eigenVectors = eigenVectors[:, args]
return eigenValues, eigenVectors
cases_cov = get_cov_mat(cases_std)
eigenValues, eigenVectors = get_sorted_eigen(cases_cov)
def get_explained_variances(eigenvalues: np.ndarray) -> np.ndarray:
eigValSum = sum(eigenvalues)
expVar = [eigV/eigValSum*100 for eigV in eigenvalues]
cumExpVar = np.cumsum(expVar)
index_of_99 = (cumExpVar[:10] >= 99).tolist().index(True)
print(index_of_99)
plt.bar(range(1, 11), expVar[:10], label='Explained Variance')
plt.plot(range(1, index_of_99 + 1), cumExpVar[:index_of_99], 'r-o', label='Cumulative Explained Variance (< 99%)')
plt.plot(range(index_of_99 + 1, 11), cumExpVar[index_of_99: 10], 'g-o', label='Cumulative Explained Variance (>= 99%)')
plt.plot([1, 10], [99, 99], 'g:', label='99% Cumulative Explained Variance')
plt.legend(loc='center right',)
plt.xlabel("Principle Component")
plt.ylabel("Percent Cumulative Variance")
plt.title("Scree plot of the first 10 PCs")
plt.show()
return cumExpVar
cumExpVar = get_explained_variances(eigenValues)
7
Let $x$ be the minimum number of principle components needed for $ \geq 99 \%$ cumulative variance, and $\lambda_i$ be the sorted decreasing eigenvalues of the covariance matrix. Then $$ x = \min \{ i : \frac{\sum_{j = 1}^i \lambda_i}{\sum_{i = 1}^N \lambda_i} \geq 0.99 \} $$ In this case, $x = 8$.
def plot_16_eigenvectors(eigenVectors: np.ndarray, features) -> None:
fig, axes = plt.subplots(4, 4, figsize=(16, 16))
axes = axes.flatten()
for i, (ev, ax) in enumerate(zip(eigenVectors[:16], axes)):
ax.plot(features, ev, 'b-')
ax.title.set_text("PC " + str(i + 1))
xmin, xmax = ax.get_xlim()
ax.set_xticks(np.round(np.linspace(xmin, xmax, 3), 2))
fig.tight_layout()
plt.show()
plot_16_eigenvectors(eigenVectors, cases_raw.columns)
Create a function that:
Accepts a country and the original dataset as inputs.
Calls useful functions that you designed in previous parts to compute eigenvectors and eigenvalues.
Plots 4 figures:
inverse_transform help...(df - df_reconstructed). On the x-axis, you have dates, and on the y-axis, the residual error.Test your function using the United States, Canada, and China as inputs. [0.5]
from sklearn.metrics import mean_squared_error
def plot_country_figures_PCA(original_df: pd.DataFrame, country_name: str, measurement: str ='Cases') -> Tuple[Tuple[int, np.ndarray], pd.DataFrame]:
country_data = original_df.loc[[country_name]]
# Country time series data
fig, axes = plt.subplots(1, 1)
axes.plot(country_data.columns, country_data.values[0], 'b-')
xmin, xmax = axes.get_xlim()
axes.set_xticks(np.round(np.linspace(xmin, xmax, 3), 2))
axes.title.set_text(country_name)
fig.supxlabel("Date")
fig.supylabel(measurement)
plt.show()
plt.close()
# incremental reconstruction
fig, axes = plt.subplots(1, 5, figsize=(18, 4))
cases_std, scaler = standardize(original_df)
country_std = cases_std.loc[[country_name]]
cases_cov = get_cov_mat(cases_std)
eigenvalues, eigenvectors = get_sorted_eigen(cases_cov)
eigenvectors = np.array(eigenvectors)
PCs = eigenvectors[:, :16]
recons = []
for recon_num in range(5):
W = PCs[:, :2**recon_num]
recon = (country_std @ W) @ W.T
data = scaler.inverse_transform(recon)[0]
recons.append((2**recon_num, data))
ax_recon = axes[recon_num]
ax_recon.plot(original_df.columns, data, 'r-')
ax_recon.title.set_text("First " + str(2**recon_num) + " PCs")
xmin, xmax = ax_recon.get_xlim()
ax_recon.set_xticks(np.round(np.linspace(xmin, xmax, 3), 2))
fig.suptitle("Reconstruction from Principle Components (" + country_name + ")")
fig.supxlabel("Date")
fig.supylabel(measurement)
fig.tight_layout()
plt.show()
plt.close()
# Residual errors from given number of PC reconstruction
fig, axes = plt.subplots(1, 1, figsize=(4, 4))
xmin, xmax = ax_recon.get_xlim()
axes.set_xticks(np.round(np.linspace(xmin, xmax, 3), 2))
for i, data in recons:
axes.plot(original_df.columns, (country_data - data).values[0], label="First " + str(i) + " PCs")
fig.tight_layout()
plt.legend()
plt.title("Residual Errors")
plt.xlabel("Date")
plt.ylabel("Residual Error (Actual - Reconstructed)")
plt.show()
plt.close()
# RMSE
rmse = pd.DataFrame(index=list(range(1, 11)), columns=['RMSE'])
rmse.index.name = 'Number of PCs'
for recon_num in range(10):
W = PCs[:, :recon_num]
recon = (country_std @ W) @ W.T
data = scaler.inverse_transform(recon)[0]
rmse.iloc[recon_num] = mean_squared_error(y_true=country_data.values[0], y_pred=data, squared=False)
return recons, rmse
PCA_data = {
'Canada' : plot_country_figures_PCA(cases_raw, "Canada"),
'United States' : plot_country_figures_PCA(cases_raw, "United States"),
'China' : plot_country_figures_PCA(cases_raw, "China")
}
Modify your code in part 3 to use SVD instead of PCA for extracting the eigenvectors. [1]
Explain if standardization or covariance computation is required for this part. Repeat part 3 and compare your PCA and SVD results. Write a function to make this comparison [0.5], and comment on the results. [0.5].
from scipy.linalg import svd
def plot_country_figures_SVD(original_df: pd.DataFrame, country_name: str, measurement: str = 'Cases') -> Tuple[Tuple[int, np.ndarray], pd.DataFrame]:
country_data = original_df.loc[[country_name]]
""" # Country time series data
fig, axes = plt.subplots(1, 1)
axes.plot(country_data.columns, country_data.values[0], 'b-')
xmin, xmax = axes.get_xlim()
axes.set_xticks(np.round(np.linspace(xmin, xmax, 3), 2))
axes.title.set_text(country_name)
fig.supxlabel("Date")
fig.supylabel("Cases")
plt.show()
plt.close() """
# incremental reconstruction
fig, axes = plt.subplots(1, 5, figsize=(18, 4))
cases_std, scaler = standardize(original_df)
U, s, VT = svd(original_df)
PCs = VT[:16, :]
recons = []
for recon_num in range(5):
W = PCs[:2**recon_num, :].T
data = country_data @ W @ W.T
data = data.values[0]
recons.append((2**recon_num, data))
ax_recon = axes[recon_num]
ax_recon.plot(original_df.columns, data, 'r-')
ax_recon.title.set_text("First " + str(2**recon_num) + " PCs")
xmin, xmax = ax_recon.get_xlim()
ax_recon.set_xticks(np.round(np.linspace(xmin, xmax, 3), 2))
fig.suptitle("Reconstruction from Principle Components (" + country_name + ")")
fig.supxlabel("Date")
fig.supylabel(measurement)
fig.tight_layout()
plt.show()
plt.close()
# Residual errors from given number of PC reconstruction
fig, axes = plt.subplots(1, 1, figsize=(4, 4))
xmin, xmax = ax_recon.get_xlim()
axes.set_xticks(np.round(np.linspace(xmin, xmax, 3), 2))
for i, data in recons:
axes.plot(original_df.columns, (country_data - data).values[0], label="First " + str(i) + " PCs")
fig.tight_layout()
plt.legend()
plt.title("Residual Errors")
plt.xlabel("Date")
plt.ylabel("Residual Error (Actual - Reconstructed)")
plt.show()
plt.close()
# RMSE
rmse = pd.DataFrame(index=list(range(1, 11)), columns=['RMSE'])
rmse.index.name = 'Number of PCs'
for recon_num in range(10):
W = PCs[:recon_num, :].T
data = country_data @ W @ W.T
data = data.values[0]
rmse.iloc[recon_num] = mean_squared_error(y_true=country_data.values[0], y_pred=data, squared=False)
return recons, rmse
SVD_data = {
'Canada' : plot_country_figures_SVD(cases_raw, "Canada"),
'United States' : plot_country_figures_SVD(cases_raw, "United States"),
'China' : plot_country_figures_SVD(cases_raw, "China")
}
def compare_PCA_SVD(PCArmse: pd.DataFrame, SVDrmse: pd.DataFrame) -> pd.DataFrame:
comparison = pd.DataFrame(index=list(range(1, 11)), columns=['PCA RMSE', 'SVD RMSE', 'Difference (SVD - PCA)'])
comparison['PCA RMSE'] = PCArmse.iloc[:, 0]
comparison['SVD RMSE'] = SVDrmse.iloc[:, 0]
comparison['Difference (SVD - PCA)'] = SVDrmse.iloc[:, 0] - PCArmse.iloc[:, 0]
return comparison
compare_PCA_SVD(PCA_data['Canada'][1], SVD_data['Canada'][1])
| PCA RMSE | SVD RMSE | Difference (SVD - PCA) | |
|---|---|---|---|
| 1 | 133789.245364 | 221855.568392 | 88066.323028 |
| 2 | 105134.474703 | 98010.767324 | -7123.707379 |
| 3 | 99057.781785 | 97418.783358 | -1638.998427 |
| 4 | 94259.498423 | 87947.849058 | -6311.649365 |
| 5 | 94288.074446 | 85876.709467 | -8411.364979 |
| 6 | 88251.911049 | 85857.378523 | -2394.532527 |
| 7 | 86360.544149 | 82127.660302 | -4232.883847 |
| 8 | 83808.041194 | 82084.378482 | -1723.662713 |
| 9 | 82515.135196 | 57980.64609 | -24534.489107 |
| 10 | 82924.693613 | 57618.756483 | -25305.93713 |
compare_PCA_SVD(PCA_data['United States'][1], SVD_data['United States'][1])
| PCA RMSE | SVD RMSE | Difference (SVD - PCA) | |
|---|---|---|---|
| 1 | 1338511.649811 | 1417867.352979 | 79355.703168 |
| 2 | 1996272.281708 | 1111788.699566 | -884483.582142 |
| 3 | 1213454.614165 | 1064634.760186 | -148819.853979 |
| 4 | 69242.615681 | 110285.717082 | 41043.101402 |
| 5 | 69187.526774 | 75460.055524 | 6272.52875 |
| 6 | 41610.235426 | 55670.315403 | 14060.079977 |
| 7 | 29142.892932 | 55420.411619 | 26277.518687 |
| 8 | 31250.771374 | 55000.652492 | 23749.881117 |
| 9 | 15079.045827 | 24389.618086 | 9310.572259 |
| 10 | 15752.40748 | 24145.179414 | 8392.771934 |
compare_PCA_SVD(PCA_data['China'][1], SVD_data['China'][1])
| PCA RMSE | SVD RMSE | Difference (SVD - PCA) | |
|---|---|---|---|
| 1 | 8751694.563669 | 8860845.601538 | 109151.037869 |
| 2 | 2052843.950472 | 1342015.162167 | -710828.788305 |
| 3 | 2052699.431026 | 64984.094473 | -1987715.336553 |
| 4 | 27038.817665 | 37560.561122 | 10521.743457 |
| 5 | 27042.323005 | 36283.231364 | 9240.908359 |
| 6 | 26480.12478 | 18166.623604 | -8313.501176 |
| 7 | 22570.944004 | 16428.833096 | -6142.110908 |
| 8 | 22977.657002 | 16258.656346 | -6719.000655 |
| 9 | 12762.959379 | 8422.864214 | -4340.095164 |
| 10 | 12906.962979 | 7902.373816 | -5004.589164 |
For Canada and China, SVD outperformed PCA and had a lower RMSE score for the majority of principal components used. The standardized and covariance matrices are not required to find the eigenvectors using SVD, since SVD only requires $A \cdot A^T$ and $A^T \cdot A$ to compute the left and right eigenvectors + singular values.
Create another dataset similar to the one provided in your handout using the raw information on COVID deaths (not vaccinations) provided here. [1]
You need to manipulate the data to organize it in the desired format (i.e., the same format that was in previous parts). You are free to use any tools you like, from Excel to Python!
In the end, you should have a new CSV file with more dates (features) compared to the provided dataset.
Upload your new dataset (in CSV format) to your colab notebook and repeat part 4. When analyzing Canada, you may use just Ontario. For China, you may use Beijing. [1]
Don't forget to add your new CSV file to Quercus for reference. The code below helps you to upload your new CSV file to your colab session.
covid_data = pd.read_csv("time-series-19-covid-combined_csv.csv")
covid_data = covid_data.pivot(index=['Country/Region', 'Province/State'], columns="Date", values="Deaths")
covid_data.to_csv("formatted_covid_data.csv")
covid_data.index = [''.join([label[0], '_' + label[1]] if label[1] == label[1] else [label[0], '']) for label in covid_data.index.values]
covid_data.columns = pd.to_datetime(covid_data.columns)
PCA_deaths_data = {
'Canada_Ontario' : plot_country_figures_PCA(covid_data, "Canada_Ontario", measurement='Deaths'),
'US' : plot_country_figures_PCA(covid_data, "US", measurement='Deaths'),
'China_Beijing' : plot_country_figures_PCA(covid_data, "China_Beijing", measurement='Deaths')
}
SVD_deaths_data = {
'Canada_Ontario' : plot_country_figures_SVD(covid_data, "Canada_Ontario", measurement='Deaths'),
'US' : plot_country_figures_SVD(covid_data, "US", measurement='Deaths'),
'China_Beijing' : plot_country_figures_SVD(covid_data, "China_Beijing", measurement='Deaths')
}
compare_PCA_SVD(PCA_deaths_data['Canada_Ontario'][1], SVD_deaths_data['Canada_Ontario'][1])
| PCA RMSE | SVD RMSE | Difference (SVD - PCA) | |
|---|---|---|---|
| 1 | 4384.053366 | 6643.387885 | 2259.334518 |
| 2 | 2084.367043 | 749.34795 | -1335.019093 |
| 3 | 2166.22009 | 429.532015 | -1736.688075 |
| 4 | 1054.076782 | 330.451888 | -723.624894 |
| 5 | 798.12431 | 330.340244 | -467.784066 |
| 6 | 756.546163 | 329.570383 | -426.975781 |
| 7 | 570.667156 | 189.595948 | -381.071208 |
| 8 | 494.499028 | 182.241745 | -312.257283 |
| 9 | 625.049983 | 125.137761 | -499.912222 |
| 10 | 249.746605 | 123.596289 | -126.150315 |
compare_PCA_SVD(PCA_deaths_data['US'][1], SVD_deaths_data['US'][1])
| PCA RMSE | SVD RMSE | Difference (SVD - PCA) | |
|---|---|---|---|
| 1 | 469158.988396 | 479964.284092 | 10805.295696 |
| 2 | 51257.24739 | 40250.960771 | -11006.286619 |
| 3 | 48958.532237 | 11789.137621 | -37169.394616 |
| 4 | 20354.407291 | 3423.079758 | -16931.327533 |
| 5 | 11482.657811 | 1993.179306 | -9489.478504 |
| 6 | 6037.089565 | 1447.863498 | -4589.226067 |
| 7 | 4718.873073 | 1400.633785 | -3318.239288 |
| 8 | 4533.110757 | 1144.070479 | -3389.040278 |
| 9 | 1475.919095 | 865.849372 | -610.069722 |
| 10 | 991.362137 | 642.049244 | -349.312893 |
compare_PCA_SVD(PCA_deaths_data['China_Beijing'][1], SVD_deaths_data['China_Beijing'][1])
| PCA RMSE | SVD RMSE | Difference (SVD - PCA) | |
|---|---|---|---|
| 1 | 10858.706419 | 8.713415 | -10849.993003 |
| 2 | 805.876072 | 4.856634 | -801.019437 |
| 3 | 846.817379 | 3.767875 | -843.049504 |
| 4 | 613.315418 | 3.597618 | -609.717801 |
| 5 | 330.401241 | 2.930725 | -327.470515 |
| 6 | 98.715957 | 2.910794 | -95.805163 |
| 7 | 62.182047 | 1.924482 | -60.257565 |
| 8 | 58.201555 | 1.884536 | -56.317019 |
| 9 | 66.386252 | 1.871746 | -64.514506 |
| 10 | 41.212904 | 1.846639 | -39.366265 |
Here SVD had a lower RMSE for all 3 countries over almost all PC count. For Beijing however, because they are so far (in-terms of variance) from the mean, neither PCA nor SVD were able to adequately recover the shape of the raw curve.
%%cmd
jupyter nbconvert --to html "C:\Users\nurei\Documents\MIE\APS1070\project\3\W23_APS1070_Project_3".ipynb --HTMLExporter.theme=dark
Microsoft Windows [Version 10.0.22621.1265] (c) Microsoft Corporation. All rights reserved. (venv) c:\Users\nurei\Documents\MIE\APS1070\project\3>jupyter nbconvert --to html "C:\Users\nurei\Documents\MIE\APS1070\project\3\W23_APS1070_Project_3".ipynb --HTMLExporter.theme=dark
Understanding PCA and SVD:
https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.8-Singular-Value-Decomposition/
PCA:
Covid Data: